Biological sequence compression algorithms.

نویسندگان

T Matsumoto

K Sadakane

H Imai

چکیده

Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Biological Sequence Compression Technique Using LUT And Repeat In The Sequence

Data compression plays an important role to deal with high volumes of DNA sequences in the field of Bioinformatics. Again data compression techniques directly affect the alignment of DNA sequences. So the time needed to decompress a compressed sequence has to be given equal priorities as with compression ratio. This article contains first introduction then a brief review of different biological...

متن کامل

Using sequence compression to speedup probabilistic profile matching

MOTIVATION Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a...

متن کامل

A Biological sequence Compression based on Look up Table (LUT) using Complementary Palindrome of Fixed size

Data Storage costs have an appreciable proportion of total cost in the creation and analysis of DNA sequences. In particular, the increase in the DNA sequences is highly remarkable with compare to increase in the disk storage capacity. General text compression algorithms do not utilize the specific characteristics of DNA sequences. In this paper we have proposed a compression algorithm based on...

متن کامل

A Lossy Compression Technique Enabling Duplication-Aware Sequence Alignment

In spite of the recognized importance of tandem duplications in genome evolution, commonly adopted sequence comparison algorithms do not take into account complex mutation events involving more than one residue at the time, since they are not compliant with the underlying assumption of statistical independence of adjacent residues. As a consequence, the presence of tandem repeats in sequences u...

متن کامل

DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information theory are often perceived as being of interest for data communication and storage. In recent years, a substantial effort has been made for the application of tex...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Genome informatics. Workshop on Genome Informatics

دوره 11 شماره

صفحات -

تاریخ انتشار 2000

Biological sequence compression algorithms.

نویسندگان

چکیده

منابع مشابه

An Efficient Biological Sequence Compression Technique Using LUT And Repeat In The Sequence

Using sequence compression to speedup probabilistic profile matching

A Biological sequence Compression based on Look up Table (LUT) using Complementary Palindrome of Fixed size

A Lossy Compression Technique Enabling Duplication-Aware Sequence Alignment

DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

عنوان ژورنال:

اشتراک گذاری